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[57] ABSTRACT 

A collection of equivalence rules involving the multiset 
version of the relational algebraic theta-semijoin operator is 
used to generate relational algebraic expressions equivalent 
to a computer programming language query. These expres- 
sions may be employed as a search space which is utilized 
by, for example, optimizing software or software that deter- 
mines the equivalency of queries. Cost formulas for the 
multiset version of the theta-semijoin operator may be used 
in computing cost estimates for the generated expressions. 
Based on these computed cost estimates, the least costly 
implementation of a complex query is determined. Thus, 
queries are cost-based optimized on both a local and global 
basis by use of the relational algebraic theta-semijoin opera- 
tor. 
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OPTIMIZATION OF QUERIES USING 
RELATIONAL ALGEBRAIC THETA- 
SEMUOIN OPERATOR 

The present application claims priority to co-pending 
U.S. Provisional application Ser. No. 60/018,391 which was 
filed on May 29, 1996. 

FIELD OF THE INVENTION 

The invention relates generally to database management 
systems and in particular to the optimization of SQL queries. 

BACKGROUND OF THE INVENTION 

Relational database management systems are well-known 
in the art. In a relational database, information is structured 
in a collection of tables in which data values are stored in 
rows under various column headings. The Structured Query 
Language ("SQL") allows users to access databases main- 
tained under any number of relational database management 
systems and has become the standard for relational database 
access. 

Data is retrieved from the relational database by means of 
a SQL query, such as, in particular, a so-called SQL 
"SELECT' statement. A simple SQL SELECT statement 
may be of the form 

SELECT specified field(s) 

FROM specified table(s) 

WHERE specified coDditioQ(s) is true. 
For example, the query 

SELECT name 

FROM employees 

WHERE sal=100 
results in a list of the names of those employees earning 
$100, where "employees" is a table defined to include 
information about all employees of a particular company. 

Other operations may be specified in, or result fix)m, a 
SQL query. Some examples are as follows. Data from two or 
more tables may be combined in a "join" operation, "Views" 
can be derived from one or more so-called "base tables." 
Aggregates, e.g., such operators as SUM and COUNT, 
specify operations to be performed on the collection of 
values in some column of a table. The GROUP BY operator 
allows for tables to be grouped by any combination of their 
fields. Finally, SELECT statements may also be nested, 
thereby forming different types of subquerics. 

Since any combination of such SQL operations as those 
above may be found in one SQL query, a SQL query may 
become quite complex, and, in fact, this complexity has 
increased as SQL queries have evolved over time. In 
particular, simple queries are typically "one block** queries, 
that is, they can be expressed with one SELECT statement 
having single FROM, WHERE, HAVING, and/or 
GROUPBY clauses. Simple queries have no subqueries or 
views. In contrast, a complex SQL query is composed of 
multiple blocks. An example of a complex SQL query is the 
so-called "decision-support** queries. Organizations have 
come to base decisions on results from these queries which 
are often defined using grouping/aggregation view relations 
and correlated subqueries (i.e., a subquery which is depen- 
dent upon some variable(s) whose value is determined in an 
"outer" query). 

SQL queries express what results are requested but do not 
state how the results should be obtained. In other words, the 
query itself does not tell how the query should be evaluated 
by the relational data base management system. Rather, a 
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component called the optimizer determines the "plan" or the 
best method — ^for example, in terms of I/O and CPU pro- 
cessing costs — of accessing the data to implement the SQL 
query. 

5 Because of the potential complexity of SQL queries, 
query optimization, especially with respect to decision- 
support queries, has become very important. Different 
approaches to decision-support query optimization include 
the use of relational algebra and "magic sets rewriting". 
Relational Algebra 

Translating simple SQL queries into relational algebraic 
expressions is a well-known optimization technique. Gen- 
erally speaking, a query is received by a database manage- 
ment system either interactively from a user or from a 
program in which the query is embedded The optimizer or 
optimizing portion of the database management system 
either translates the query into a relational algebraic expres- 
sion or receives the already-translated relational algebraic 
expression from another component of the database man- 
agement system. In either case, once the SQL query is in the 

20 form of a relational algebraic expression, so-called "equiva- 
lence rules** transform the expression into other equivalent 
algebraic expressions, thereby generating a "search space" 
or "space", i.e., the number of different ahernative imple- 
mentations that an optimizer will consider. 

25 f* j 3nce the search space is generated, cost estimates for 
each algebraic expression can be generated by utilizing the 
cost formulas for the relational algebraic operators and the 
different ways of evaluating these operators. The estimated 
least costly alternative is then chosen as the plan. For 

30 example, a join of two relations (or tables) may be imple- 
mented by choosing one relation to be the "outer** relation 
and, for each tuple (or row) of that outer relation, finding all 
matching tuples of the other relation (called the "inner" 
relation). These matching tuples are then concatenated to the 

35 tuple of the outer relation. Although the actual cost for the 
join depends on the particular database system, determining 
the outer and inner relations, or using other methods to 
implement the join, may affect the estimated cost of per- 
forming the \om>j 

40 Variations of "the above technique can be used for the 
optimization of complex queries. For example, in one 
variation, a complex SQL query is broken into smaller 
blocks. These blocks are then translated to relational alge- 
braic expressions to which the equivalence rules and the 

45 above procedure are applied. The result is that, for each 
block, the "optimal" alternative is determined. This is 
referred to as "local" optimization. However, the optimiza- 
tion of the interaction between the blocks, the so-called 
"global" optimization, is performed on an ad-hoc basis 

50 outside of the relational algebra framework. 

^e relational operator called the semijoin operator has 
been used in the prior art to optimize simple distributed 
queries for set sema ntics ( i.e., queries whose results include 

5^ no duplicate values)l5n particular, it is used to optimize joins 

55 of database relations^ distributed database systems. Joins 
in distributed database systems are potentiajlyjcostly^pe ra- 
tions Secause 'in such system s'lhe datT to be . join&^l^is 
residcntlitTlifferedtliites, thereby incurring comm unic ation 
costs' as" well"as"pfocessin gncdstsrB y'fi rst'""^ a 

60 semijoin,* thT^ocessing_site_send_s_join-informatiDn_to_a 
receiving'site~ana_only t he d^ta which would in^f act-ioia-is — 
detefmine"d~at" tlie receiving site _^3inceJe ss data is t ransmit- 
tecI'ffom'theTeceiving site to the processing site, the costs of 

(i^tfCoHmiinicatingHhe^datOcOfie^ 

65 Magic Sets Rewriting 

To further improve the optimization process, the tech- 
nique called "magic sets rewriting" is tised to increase the 
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search space. Magic sets rewriting optimizes complex SQL equality conditions in SQL queries. We have discovered, 

queries, such as view definitions and nested subqueries, by however, that there need no longer be these restrictions, 

rewriting the queries more efficiently. Generally, the magic Rather, we have recognized that non-equality, as well as 

sets rewriting approach is to define a set of auxiliary "magic" equality, conditions in multiset SQL queries or subqueries 

(or "filter") relations that arc used to filter out irrelevant or 5 can be fully addressed in query optimization, 

repetitive data that does not contribute to the results of the Moreover, the rule-based framework of the invention 

queries, for example, data which would not be used by allows for its easy incorporation into existing cost-based 

subqueries. The most generalized form of magic set optimizer generators that, given a collection of equivalence 

rewritings, called Constraint Magic rewriting, can handle niles, generate an optimizer for a specific algebra (for 

non-equality conditions or predicates in queries, as well as 10 example, the University of Colorado's so-called "Volcano"), 

equality predicates. Further, the principles of this invention are also applicable 

For any one query, there may be many different alterna- technique of magic sets rewriting because magic sets 
tive rewritings. Generally, one or more of the rewntmgs are ^^^^y ^ jnodeled by the relational algebraic operators pro- 
selected heunsUcaUy as those likely to have lower process- ^-^^ invention. As noted above, the magic set 
mg costs. The cost of processing the selected rewnUngs is 15 rewriting approach was heuristic in that, from among vari- 
compared with the cost of proccssmg the query without the ^^^^ implementing magic sets, only one was arbi- 
magic set rcwntes, eventually choosmg the least costly way. ^^^^^ contrast, the equivalence rules of the 
Although there have been recent efforts to provide cost- ^^^^^ invention may be appUed to any group of magic set 
based techniques for selecting the most cost-effective rewnt- rewritings such that the optimal implementation may be 
ing (e.g., modeling magic sets rewntmg as a special jom 20 determined on a cost basis. Related to this concept is the 
method), magic sets rewriting generally remains a heuristic invention discussed in our co-pending U.S. patent 
technique, with only a minimal cost-based component. application. Sen No. 08/862,906 in pending entitled "Effect- 
SUMMARY OF THE INVENTION ing Constraint Magic Rewriting With the Theta-Semijoin 

According to the principles of the present invention, a ^5 ^pe^ator and filed of even date herewith, in which Con- 
collection of equivalence (or transformation) rules involving Magic rewriting (an extension of Magic Set 
the theta-semijoin operator is used, in addition to equiva- ^ ^ffecUvely generated for non-recursive quenes. 
lence rules known in the prior art, to generate relational FinaUy, unlike magic set rewntmg, the mvention works at 
algebraic expressions that are equivalent to a SQL query. ttie relational algebraic level, not the level of the SQL 
These expressions may be employed as a search space which 30 syntax, and can therefore be used for any version of SQL or, 
is utilized by optimizing software or, alternatively, these indeed, any other language translatable to relational algebra, 
expressions may be utiUzed by other types of soft>varc to, for Thus the principles of this invention are not limited to SQL 
example, determine the equivalency of querie^urther, the queries but may be applied to queries of any language 
co st formula^ jiioy ided herein for the multiset version of the translatable to relational algebra. 

thjgr^SgSin^^ 35 BRiEp DESCRIPTION OF THE DRAWING 
for other multiset relatio nal algebra ic operato rSj to em- 

ciently compute cost estimate s lor the g enerated ex pres- FIG. 1 illustrates an optimization technique used in the 

sionsT^ased on these computed cost estimates, the least prior art. 

costly"implementation of a complex query is determinecfj pjQ 2 illustrates an embodiment of an optimization 

Thus, SQL queries are cost-based optimized on both a local 40 technique according to the principles of the invention, 

and global basis by use of the relational ^gebraic theta- 3 iu^trates a computer environment for use in 

scmijoin operator. Smce the semantics of SQL is based on accordance with the invention. 

multiset relations (where query results may include duph- .1 * . • n j * . *u «• • i f 

1 \ *L !*• * i-u \ c*u *u * ■• • FlG.4pictorially demonstrates the effective simulation of 

cate values), the multiset (bag) version of the theta-semijoin _ , . , i,, - 

, . ^ . V &/ J Constraint Magic rewntmg. 

operator is used. 45 & & 

The addition of the equivalence rules of the invention to DETAILED DESCRIPTION 

those known in the prior art results in a more extensive Overview 

collection of equivalence rules from which a much larger FIG. 1 illustrates a simplified view of one prior art SQL 

search space for the optimizer is generated. Although the query optimization technique. Upon receiving SQL query 

larger search space could result in an estimated less expen- 50 10, optimizer 12 breaks down the query into any number of 

sive plan being chosen by the optimizer, it could also "blocks" U indicated as B1-B4. These blocks are then 

increase the cost of the optimization itseff. Therefore, translated by a translation process 13 into relational alge- 

according to a feature of this invention, heuristics are braic expressions 14, indicated as R1-R4. A set of equiva- 

provided which, in preferred embodiments, restrict the lence rules 15 are applied to each expression thereby 

applicability in certain cases of the equivalence rules that 55 generating, for each expression, a search space of equivalent 

introduce the theta-semijoin operator. relational algebraic expressions 16. These search spaces are 

There are several advantages provided by the invention. indicated as S1-S4. Once each search space is generated. 

By using the theta-semijoin operator, embodiments of this known cost estimates 17 for the various relational algebraic 

invention provide for the optimization of certain queries that operators can be used to determine the optimal implemen- 

have not been optimized well using traditional optimization 60 tation 18 for that particular block, indicated as 01-04. 

techniques. In particular, these queries include decision Finally, the optimizer determines the optimal interaction 

support queries, decision support queries, queries with among blocks using certain heuristics 19, thereby producing 

aggregation views, queries with SELECT DISTINCT views, plan 20. The latter is the method of accessing the data that 

and queries with correlated subquery and outerjoins. Based will be used to implement the SQL query, 

on the disclosures in the prior art of which we are aware, the 5S FIG. 2 illustrates an embodiment of an optimization 

use of the semijoin operator has generally been restricted to technique according to the principles of the invention. A 

set semantics and, further, has been used to address only complex SQL query 10 is broken down into blocks 21, 
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which are then translated by a translation process 22 into 
relational algebraic expressions 23. In contrast to the prior 
art technique described above, the "interrelationships" 
among the blocks 21 are also considered (as explained more 
fully hereinbelow) and expressed by using the thela semi- 5 
join operator. A set of equivalence rules consisting of the 
prior art equivalence rules 15 and the equivalence rules of 
the present invention 25 is used to generate search space 26. 
By considering not only blocks 21, but also the interrela- 
tionships among the blocks, in the relational algebraic 10 
ejqsressions comprising search space 26, cost-based local 
and global optimizations are possible. Further, unlike the 
prior art technique of FIG. 1, search space 26 may be 
comprised of alternative implementations of the SQL query 
that address complexities such as correlated subqueries with 15 
inequality conditions in the WHERE clause. 

Using the additional equivalence rules of the invention 
expands the search space considered by the optimizer from 
that which would have been generated in the prior art. 
Depending on the particular query, the consideration of a 20 
much larger search space may not be the most cost-effective 
technique. Therefore, a set of heuristics 24 is provided such 
that the use of the theta-semijoin operator will be restricted 
in certain circumstances. 

Once the search space 26 is generated, cost estimates of 25 
the prior art relational algebraic operators 17 and the theta- 
semijoin cost estimates of the invention 28 are utilized to 
determine the optimal method of evaluating the SQL query 
10, the plan 20. 

Computer Environment 30 

FIG. 3 illustrates an embodiment of a computer environ- 
ment in which the invention may be used. In particular, the 
computer hardware environment includes processor 30 to 
which one or more electronic storage devices 36 and display 
terminal 35 may be connected. Processor 30 having a 35 
memory supports a relational data base management system 
(RDBMS) software 32 comprised of various software com- 
ponents including optimizer 34. The relational data bases 
and other related components (such as logging datasets) may 
reside on electronic storage devices 36 connected to pro- 40 
cessor 30. SQL queries may be input to the relational data 
base management system 32 interactively from user at 
display terminal 35 or by the batch execution of a computer 
software program 33 having embedded SQL statements. The 
queries, stored in memory, are thereafter operated upon by 45 
the relational data base management system, in particular, by 
the optimnizer in accordance with the features of the inven- 

^The equivalence mles and cogtegtiniates provided by the 
invention may be stored in any physical medium, such as a 50 
computer diskette 37, for eventual input to optimizer gen- 
erator software 38. The optimizer generator software 38 in 
turn generates the appropriate optimizer given the collection 
of rules and estimates. Alternatively, the rules and estimates 
may be input directly to the data base management system 
optimizer if the optimizer is so-called "extensible" or 
designed to accept and utilize such inpuD 

The following sections provide an in-depth consideration 
of the invention. 
I. Illustrative Example 

As discussed earlier, in a relational database, information 
is stmctured in tables (or relations) in which data values are 
stored in rows (or tuples). A SQL SELECT statement is used 
to retrieve information from the database. The fields to be 
selected are specified in the SELECT clause by either the 
field name (e.g., fielda) or by a name in which the table name 
and field name are concatenated (e.g., tableB.fieldA). The 



latter form may be used if there is more than one table 
containing fielda referenced in the FROM clause. A key may 
be defined for a table and generally acts as an index to that 
table. 

To see how the use of the theta-semijoin transformations 
reduces the cost of answering a SQL query, consider the 
following example of an application where a supplier sup- 
plies many types of parts, each part having a given dollar 
value or cost (not to be confused with the above-discussed 
SQL query cost). The supplier has many warehouses. Each 
warehouse can store multiple types of parts, and each type 
of part can be stored in multiple warehouses. The relations 
are: Warehouse (Locld. Part, Quantity) and Parti nfo (Part, 
Cost), where the underlined attributes form a key. 

The supplier receives a set of requests for parts from a 
customer, as expressed by the relation Requests (Part, 
ReqQuanlity, PriceLimit). Each row or tuple in the relation 
indicates the part, the required quantity and the maximum 
price per part the customer is willing to pay for the requested 
part. To determine which of the customer requests can be 
satisfied by the parts available in the warehouses, the sup- 
plier asks the following query Ql: 



SEUECT Available. Part 
FROM Requests, Available 

WHERE Rcqucsts.Part = Availablc.Part AND Cost <» PriccUmit 
AND TotalOuantity >= ReqQuantity 



where the view relation Available, defined below, contains 
information about the cost and total quantity of each part 
available in all the warehouses. 



CREATE VIEW 


Available (Part, Cost, TotalQuantity) AS 


SELECT 


Warehouse.Part, Cost, SUM (Quantity) 


FROM 


Warehouse, E*artIafo 


WHERE 


Warehouse. Part = Parllofo.Part 


GROUPBY 


Warehouse. Part, Cost 



Let us suppose that the supplier supplies 10,000 types of 
parts, has 20 warehouses, and stores each type of part in 10 
warehouses on the average. Thus, the Warehouse relation 
has 100,000 tuples (i.e., 10 tuples for every type of part). The 
Partlnfo relation has 10,000 tuples (i.e., one tuple for every 
type of part). Let us further suppose that the customer 
requests 100 types of parts, only 10 of which satisfy the 
condition Partlnfo. Cost<=Requests.Pricc Limit, and that the 
suppUer can supply the required quantity of each type of 
part. The Requests relation therefore has 100 tuples. 

The relative eflSciency of evaluating the query in three 
cases can be compared: (a) the entire view Available is 
computed, and then joined with the Requests relation; (b) a 
view Al, obtained using magic sets is computed, and then 
joined with the 5 Requests relation; and (c) a view A2, 
obtained using theta-semijoin transformations, is computed, 
and then joined with the Requests relation. The comparison 
is performed using two metrics: (1) the number of tuples 
generated (as a result of joins, and grouping/aggregation), 
and (2) the number of tuples on which grouping/aggregation 
is performed. The greater the number of tuples in (1) and (2), 
the greater the processing cost associated with that particular 
method of evaluating the query. 

In case (a), computing the entire view Available involves 
a join of the 10,000 tuples Pardnfo relation with the 100,000 
tuples Warehouse relation, resulting in a relation with 100, 
000 tuples. A grouping and aggregation is performed on this 
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relation and the Available view has 10,000 tuples. The final 
join with the 100 tuple Requests relation results in comput- 
ing 10 answers to the query Ql. A total of 110,010 (100, 
000+10,000+10) tuples are generated, and 100,000 tuples 
are grouped/aggregated. 

In case (b), the magic sets rewriting optininization tech- 
nique calls for the computation of a view relation 
Query Available(Part) called a "magic" or "query" relation. 
This view relation (which basically is the "parts" of the 
Requests relation) is used to restrict the computation of the 
view Available to generate only tuples that would be answers 
to the correlated queries on view Available, based on the 
equality Requests.Part- Available. Part. Since the Requests 
relation has 100 tuples, the QueryAvailable relation has 100 
tuples as well. This is done by creating a specialized version 
Al of the view Available (which references the view relation 
QueryAvailable) and by modifying query Ql to use Al 
instead of view Available. 

It can be shown that computing view Al involves a join 
of the 100 tuples QueryAvailable relation with the 10,000 
tuples Partlnfo relation, and the resulting 100 tuples inter- 
mediate result is joined with the 100,000 tuples Warehouse 
relation, resulting in a relation with 1000 tuples, A grouping 
and aggregation is performed on this relation and the Al 
view relation has 100 tuples. The final join with the 100 
tuple Requests relation results in computing 10 answers to 
the query Ql. Atotal of 1210 (100+1000+100+10) tuples are 
generated, and 1000 tuples are grouped/aggregated. 

In case (c), the version A2 of the view Available is defined 
by using the theta-semijoin equivalence rules of the present 
invention. In contrast to the magic sets rewriting technique, 
the theta-semijoin equivalence rules consider the inequality 
condition '*Cost<=PriceLimit" in addition to the equality 
condition "Requests .price =Warehouse.price". Applying 
both these conditions further reduces the set of tuples 
computed for the view relation Available — there are less 
tuples in version A2 than in magic set version Al — thereby 
significantly reducing the cost of query answering. 

It can be shown that computing view A2 involves a join 
of the 100 tuples (projected) Requests relation with the 
10,000 tuples Partlnfo relation. The resulting 10 tuples 
intermediate result is joined with the 100,000 tuples Ware- 
house relation, resulting in a relation with 100 tuples. A 
grouping and aggregation is performed on this relation, and 
the A2 view relation has 10 tuples. The final join with the 
100 tuples Requests relation results in computing 10 
answers to the query Ql. A total of 130 (10+100+10+10) 
tuples are generated, and 100 tuples are grouped/aggregated. 

Frhus, it can be seen that the use of the theta-semijoin 
equivalence rules considerably reduced the cost of answer- 
ing the query. The size of the intermediate relations gener- 
ated is reduced and the size of the inputs to the join and 
grouping/aggregation operations are reduced. In this case, 
the optimizer would select the relational algebraic expres- 
sion of the query which utilizes the theta-semijoin operator 
as the least costly method of implementing the quer^T^ U 
IL Background and Definitions 

The symbols R (with or without subscripts) are used to 
denote relations, 9 (with or without siibscripts) to denote 
quantifier-free predicates (i.e., conditions), E (with or with- 
out subscripts) to denote relational expressions, attrs(E) to 
denote the attributes of the result of E, and a to denote a tuple 
of attributes. Relations are treated as multisets of tuples, and 
hence the multiset version of relational algebra is used, 

A grouping/aggregations operator is denoted by -¥j, 
where a denotes the groupby attributes and J denotes the 
aggregate operations performed on the groups defmed by the 
groupby attributes. 
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A. Definition of the Multiset S-Semijoin 

The multiset version of the 6-semijoin operator, is 
defined as follows. Given relations Rj and R2, 

where 6(12) denotes the condition 9 with attributes of R2 
replaced by their values from tuple t2. 

10 The definition of 0-semijoin preserves the multiset 
semantics, i.e., the multiplicity of each tuple in the result is 
exactly the same as in R^; each copy of tuple tiCRj is present 
in Ri ^el^2 ^ ^ ^^^'^ exists a tuple l2€R2 such that 

the pair of tuples t^ and t2 satisfies the condition 6. For 

15 example, if the relation R^ (A, B) is the multiset of tuples 
{(1,2), (1,2), (1,4)}, and R2 (C, D) is {(3, 5,), (3, 6), (3, 7)}, 
then R,Xcs^R2-{(l,2), (1,2)}. 

In the multiset relational algebra, 8-semijoin is a derived 
operator, and can be expressed using the 9-join, projection 

20 (jt) and duplicate elimination (6) operators as follows: 



25 



30 



where ►^^gj denotes natural join. 

Some of the 9-semijoin transformation rules described 
make use of functional dependencies present in relations. 
The fiinctional dependencies present in a relation R are 
denoted by F D(R). In addition, the transformations also 
make use of function dependencies implied by conditions 
(such as 9-join or 9-semijoin conditions). For example, the 
condition x-y*y implies the functional dependency {y}-»'X, 
and the condition x-y+z implies the functional dependencies 
{y, z}-»'X, {x, y}->z and {x, z}-*y. The notation F D (9) is 
used to denote the set of all functional dependencies implied 
by condition 9. 

The first step in the optimization is a translation to 
extended relational algebra. The standard translation of SQL 
to extended relational algebra is used. An SQL block of the 
form 



SELECT [DISTINCT] 

FROM 

WHERE 

GROUPBY 



attribute- list 
relation-list 
selection-preds 
groupby- attrs 



is translated to 



50 



If the query does not have a GROUPBY clause, the groupby/ 
aggregation operator is omitted, and if the DISTINCT is 
missing the 6 operator is omitted. 
B. Extended Relational Algebra 

Extended relational algebra expressions E are generated 
by the following grammar. 



60 



Oe(E) 
E®E 



relation name 

multiset selection 
multiset projection 
multiset union 
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E - E multiset difference ^ 

E n E multiset intersection 

EXe^ multiset e-join rm * • ^ t r ^ ' ^ t * r% 

ExeE multiset 6-Bemijoin Tn^ third observation or CoDstraiat Magic (borrowed 

EzDKfeE multiset icft-outcrjoin from Magic Sets) IS that for each of the view relations R^, 

rlS:;S"Te"r iSl used mdeAning v.. he omy tuples of R, that need ^ be 

^j(E) grouping/aggregation Computed are those that can be used in the expression S,. For 

duplicate elimination required tuples are the answers to the set of queries 

represented by (op^, query_JRiX where query_JRi is de6ned 
as 



In keeping with the SQL semantics, multiset union® is 
defined additively, ix., the cardinality of a tuple in the result 

is the sum of its cardinalities in each operand of©. Multiset query_/?i tow^tOvi*^^^ 
versions of intersection and difference also follow the SQL 
semantics. Note that cross-products and natural joins can be 

expressed using 6-joins. ^—^^ required tuples are answers to the queries 

C. Algebraic Summary of Constraint Magic Rewriting represented by (6 ^ A-iji'^^ query_R,-^i), where query_R.vi is 

This section presents a novel formulation, in algebraic defined as 
notation, of the supplementary version of the Constraint 
Magic rewriting, which was originally presented as a source 

to source rewriting of Datalog rules. This formulation qiicry_^,vi = '^«rrn(^A^'^Mo«r*t^i>^'^'^ 
enables us to establish the connection between Constraint 
Magic rewriting and G-semijoin transformations. 

Given a set of view definitions, it is assumed that each Note that the sets of queries on the R/s are in the same 

view is used at most once. Consider a view V defined by: parametrized notation as the original queries on V. Hence, 

the same procedure may be applied recursively to restrict the 

vdrf , ,„ „ , „ , evaluation of the view relations R/s to the relevant tuples. 

' The rewnting term mates when there are no more view 

definitions to be processed. A final phase discards view 

If there is a set of queries on V of the form o^^ (V), where definitions that are no longer reachable from the top-level 

the conditions are of the same form and differ only in the 35 query. 

values of constants occuring in the ij),. Constraint Magic m Transformation (or Equivalence Rules) 

rewriting starts from the top-level queries on V and is *Ti nr. • . ^ n 

• 1 ,. J • 1 4- *u * „ ^ * ♦u A. Basic eSemijoin Transformations: 
recursively apphed on view relations that are used m the 

definition of V. Consider the rewriting of view V. Optimizing SQL queries by making use of 9-semijoins 

The first observation of Constraint Magic rewriting is that ^ involves specifying equivalence rules involving 

all the r^i's can be represented by a parametrized condition O-semijoins, and other operators of the extended multiset 

ip($Ci, . . . , $c^), such that the '\\f/s differ only in the values relational algebra. Given a collection of equivalence rules, a 

for the parameters $cj, l^j^m. This set of queries can be transformational optimizer can be used to enumerate and 

represented by the parametrized condition \p and the query compactly represent the logical search space. In a subse- 

relation Q^ (Sc^, . . . , $c J containing, for each a tuple ^5 quent phase, cost formulas for the operators in the algebra 

with values for the parameters defining are used to efficiently compute cost estimates for different 

Magic Sets rewriting deals with a restricted case where tI? ways of evaluating the query, and choosing the optimal way 

is a conjunction of equality conditions of the form V.a~$cj; (in the logical search space represented) of evaluating the 

the attributes used in this conjunction are the "bound" query 

attributes in the queries. <;n ^n. » c *• u 1 • . u 

™ J L c ^ • ^ XM • • The transformations below require a renammg slep when 

The second observation or Constraint Magic rewntmg is .. ... . .i?. ^- j^ 

that to answer the set of queries on V represented by Q^), P^^^^S conditions through algebraic operations, and when 

it is not necessary to compute all of V. In fact, it suffices ^^^^^^^ expressions. For simphcity of 

instead to use a version V of V, defined as foUows: exposition, this renaming step is ignored. 

j5 1. Introduction of 6-Semijoin: Relational algebra expres- 

w/,^ w ,„ „ , „ ^ , sions generated directly from SQL queries typically do not 

^ '^^ contain the e-semijoin operator (with the exception of using 

6-semijoins only to handle HAVING clauses). Transfonna- 

The use of V is then replaced by V. tions la, lb, Ic and Id show how the 6-semijoin operator 

A novel contribution of this invention is the observation 60 can be introduced into expressions with join, oulerjoin, 

that parts of the semijoin condition can now be pushed into intersection and difference respectively, 
the join expression to filter out tuples from the results of 

intermediate joins, as well as from the relations (R^, Tmnsfomiation 1 a: E2^e, tx, (f^ tx, Eo 
R2, . • • R„) that participate in the join. 

The notation lit' is used to denote the portion of ip that 65 Transfonnation lb: Ei zxe Ej^e^ zxq (E2 xg Ei) 
involves only attributes from R^, . . . , R^ and Q^^. We define 
supplementary relations S,-, i^l as follows: 
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A symmetric result holds for right-outeijoins. 

Transformation Ic: Ei f) ^2 ^ C\{E2 Xsai £"1 ) 

where Ez^j^xr^i denotes the natural semijoin of E2 with Ej. 

Transfomiation Id: £| - £2 ~(^2 x<mn{Eiy=ann{Ei) ^1) 

2. Flushing Selections through Semijoins: 

Transfomiation 2a: cr^j {E\ £2) =* i^Bi i^i )) x^ Ei 

Transformation 2b: Ei Xfi^ /^^^ E2 s i(T&^ {Ex )) £2 

where involves only the attributes in attrs(Ei), 

Transformation 2c: £1 Xtf,Afl2 ^2 = £1 {o-92{Ei)) 

where 02 involves only the attributes in attrs (E^). 

3. Pushing/Introducing Projections: 

Transformation 3a: ffff(£i Xp £2)^C^j(^i)) Xfl ^2 

where 6 does not involve the attributes in attrs(Ei)-a. 

Transformation 3b: £1 £2 = £1 Xp (TTgCfj)) 

where 0 docs not involve the attributes in attrs (£2)-!. 

4. Pushing/Introducing Duplicate Elimination: 

Transfomiation 4a: S[Ey Xg £2) s (<5(£| )) x^ £2 
Transfomiation 4b: £1 x^ £2 o £1 Xj (^(£2)) 

5. Pushing through Union, Intersection and DiflFerence: 

Transformation 5: 

(£it±)£2) xe £3e(£i Xs £3)WCE2 x^ £3) 
Transformation 6: 

(EiPlEz) X9 (El Xs £3)0 (£2 X0 £3) 
Transfomiation 7: 

(£i-£2) X(, £3B(£i x& £3) -(£2 Xfl £}) 

B, Complex 0-Semijoin Transformations: 
1. Pushing e-Semijoin through Joins: 



10 
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Transformation 8b: 

(£1 Xa, £2) Xtf2 £3 = (£t Xf,, £'2) tx^ £3 
where £'2 = £2 txs,nfl2 f^' '^T""" 

A symmetric rule can be derived for pushing a semijoin 
into El using commutativity of 0-joins. 

Transformation 8c: 

£1 X0ir)&2 iEl Xrru, £3)a(£| Xfl, £2) Xs2 £3 

IS 

where 61 involves only the attributes in attrs (Ei)Uattrs (E^), 
and 02 involves only the attributes in attrs(Ei)Uattrs(E3). 
Note that ^Yme ^ equivalent to a cross-product. 
20 2. Pushing 0-Semijoin through Outerjoins: 

Transformation 9: 

(£1 3X9^ £2) X(^ £3 s (£, Xfl^ £3) =Xfl, £2 
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where 62 involves only the attributes in attrs(Ei)Uattrs(E3). 
3. Pushing 0-Semijoin through Grouping/Aggregation: 

Transformation 10(a): a£j(£i) Xg £2 = sFjiEi Xp £2) 

where 9 involves only the attributes in a and attrs (Ej) 
Transformation 10(b): 

£i-a^m=H»in(£i-6>=[(£^l) X j/^j^^ ^j(£j.g^a^E2,„ (£2)] 



where 0 involves only the attributes in Ej.a and attrs (Ej), 
and g(Ei.a, attrs(E2)) is any function all of whose arguments 
are from Ej.a and attrs (E2). Asymmetric result holds for the 
45 case of max with ^. 

4. Introducing Aggregation for Subsumption: Transfor- 
mation 11 shows conditions under which some of the tuples 
in the right operand of a 0-semijoin operator are redundant, 
and a grouping/aggregation operator can be introduced into 
the right operand to reduce its size. 



Tiansformation 11: 
^1 x^(^2.,^^£j,j3£^._^ (£2)3 

^1 X^£2.j^n>3(£'^))s^£2='J Fm=mialE2-y)i£2)) 



55 



Ttansformation 8a: 

(£1 Xflj £2) Xp2 £3 s £1 Xfi^ (£2 x^ £3) 



where g is any fiinction all of whose arguments are from the 
60 attributes E^.xand attrs (E^). Asymmetric result holds for the 
case of min with ^. 
In the case where the function g(E2.x,attrs(Ej)) is a single 

where 02 involves only the attributes in attrs (E2)Uattrs (E3). ^^'^^^ transformation can be applied 

^ . , , 1^.1 f u immediately after Transformation 10b. 

The symmetric rule for the case when involves only the 65 
attributes in atU-s(E,)Uattrs (E3) can be derived using the ^ Simplification/Elimination Transformations: 
commutativity of 0-joins. 1. Splitting/Combining 0-S6mijoins: 
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Transformation 12: 



By transforming the LHS to the RHS, other transformations 
(e.g.. Transformation 10a) may get enabled. By transforming 
the RHS to the LHS, a 6-semijoin can be eliminated. 

2. Simplification; Some of the 6-semijoin transformations 
can generate expressions where some conditions are 
checked more than once. The repeated checks are necessary 
in general, but in some special cases the repeated checks are 
redundant, and the expressions can be simplified by remov- 
ing them. Transformations 13a and 13b can be used to 
eliminate repeated checks, when they are applicable. 

Transfonnation 13a: 

^1 X9iA93 {^2 X(»iA%AP4 ^l) 



where attrs (E^) functionally determine all the attributes in 
62, under the functional dependencies F D(6i)UF D(Ei). 

Transfonnation 13b: 

(^1 X9lA02^93 ^2) X9iA94 ^2 



where attrs(Ei) functionally determine all the attributes in 
O2, under the functional dependencies F D (6i)UF D (E^, 

3. Eliminating 6-Semijoin: Intuitively, a 9-semijoin can 
be rewritten as a join followed by a projection if the join 
condition along with the functional dependencies of the right 
operand of the 6-semijoin guarantee that each tuple of the 
left operand is selected by at most one tuple of the right 
operand. 

This intuition is formally captured by Transformation 14: 



where E2.y is a superkey of £3, and g(attrs(E^)) is a function 
of attributes of E^ that returns a tuple of values with the same 
arity as E2 y. 

D. Complex Derived Transformations: 

Some useful complex transformations can be derived 
using combinations of the transformations described so far. 

Transformation 15: 

(El X9^ £2) Ei s (El Xe^ E3) £2 
Ttansfotmation 16: 

(Ei^EjiEi)) XfljASj ^2 =(£,^'='7(f| Xfl, El)) Xg^f^ El 



where involves only attributes in Ej.a and attrs(E2). 



Transformation 17: 

(£l tXfl El) X*|Afl2A8*3 ^3 = 

5 {{Ex txoj E3) Xe (£2 x<^ E^)) XsjAfliAffs ^3 

where 0^ involves only the attributes in attrs(Ei)Uatlrs(E3), 
and 02 involves only the attributes in attrs(E2)Uattrs(E3). 

10 

Trcinsfomiation 18a: 

^1 xj^iA^i Ei^Ey iXfljASz x^i £1) 
Transfonnation 18b: 

IS 

(£1 Xtf^Afli ^2) X^f^Q^ £3 s (£1 XfljA6>, E'2) XfljAft* ^3 
where £2 = £2 XfliAfl; f^l Xjn,, £3). 

2Q The set of transformations described herein are extensive, 
but not exhaustive; under some special conditions, other 
transformations may be applicable. For instance, in the 
special case that whenever Ej is empty, so is E^, the 
expression ^^^True ^ ^ equivalent to Ej. 
IV. 0-Semijoin Optimization of Complex SOL Queries 

Decision support queries are quite complex, and are often 
defined using view relations (or table expressions), and 
correlated subqueries. When correlated subqueries are 
decorrelated, the resulting decorrelated query usually has 
view relations (equivalently, table expressions), and may 

30 contain outeijoin operators. When the FROM clause of an 
SQL query block includes view relations, traditionally query 
optimizers attempt to "merge" the definitions of the view 
relations into the blocks that use the view relations; join 
order optimization is then performed on each merged block 

35 separately; no inter-block query optimization is performed. 
When the view relations have grouping and aggregation, 
or are defined using SELECT DISTINCT, the view defini- 
tions cannot be straightfonvardly merged, in general, while 
preserving the SQL multiset semantics. Such views are 

40 expected to arise frequently in decision support applications. 
The above-defined 0-semijoin transformations enable 
effective optimization of such complex SQL queries that are 
not optimized well using traditional cost-based optimization 
techniques. The rule-based framework allows optimization 

45 of such queries to be easily integrated with a cost-based 
optimizer. 

A. Optimizing Queries with Aggregation \^ews 
Consider an SQL query Q whose FROM clause includes 

multiple relations to be joined, one of which is a view 
50 relation defined using grouping and aggregation. Such a 

query can be represented in the extended relational algebra 

by the expression. 

C2 = £1 X0 -gFAEi) 

55 

For obvious reasons, the definition of Ej cannot be 
meiged into the definition of Q, to enable join order opti- 
mization. Previous techniques to optimize such expressions 

60 include Magic Sets, Predicate Movearound, and techniques 
that push groupby/aggregation operations through joins. The 
0-semijoin transformations combine (and exteod) the ben- 
efits of both Magic Sets and Predicate Movearound in an 
algebraic framework. 

65 Suppose 0=0^ A02, such that 0j is a conjunction of 
equality conditions of the form E^.a^and E2 a^ and a^eg- Let 
aj denote the set of all such attributes of expression E^ in O^. 
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Magic Sets transformation first defines an auxiliary view: 
Subsequently, the expression Q is transformed to 
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B. Optimizing Queries with SELECT DISTINCT Views 
Consider an SQL query Q whose FROM clause inchides 
multiple relations to be joined, one of which is a view 
relation defined using SELECT DISTINCT. Such a query 
5 can be represented in the extended relational algebra by the 
expression 



El xe jFyiiEz xe^ QE^) 

The main limitation of Magic Sets is that it cannot use 
non-equality conditions in 6 for optimizing the evaluation of 
E2. Variants of Magic Sets, such as Extended Magic Sets, 
can additionally use inequality conditions of the form 
E2.a2<c, where c is a constant, in some limited cases. 

The effect of Predicate Movearound can also be under- 
stood algebraically. Suppose the join condition 0 in Q is 
equivalent to BaB^, such that 63 is an arbitrary condition 
involving attributes only of E2 that are in g and some 
restricted kinds of attributes from J. Then the expression Q 
is transformed to 

The main limitation of Predicate Movearound is that it 
cannot use conditions involving attributes from both E^and 

The 0-semijoin transformations approach combines the 
benefits of both Magic Sets and Predicate Movearound, and 
can push arbitrary conditions in 6 (not just equality 
conditions) involving attributes of both and E2, for 
optimizing the evaluation of Ej. First, the 6-semijoin opera- 
tor is introduced, based on the join condition 0, using 
Transformation la, and Q is transformed to 

£•1 X9 {jEj{Ei) Xe £,) 

Then, the B-semijoin is pushed through the groupby/ 
aggregation operator, to restrict the number of tuples of Ej 
that participate in the grouping/aggregation to those that 
could subsequently join with Ej. The nature of the semijoin 
condition 9 determines the result. When 6 involves only 
attributes in g and attrs(E^), Transformation 10 a can be used 
to transform Q to 

El X9 jEjiEi Xe^Ei) 

The intuition here is that, for each group of E^, either all 
the tuples will be selected by (E^^e^iX or none will. The 
tuple in the result of the F operator generated from each 
group will correspondingly be selected or not. 

When 6 involves results of the aggregation, the 
6-semijoin operator cannot be puished through aggregation 
in general. Transformation 10b identifies cases when it is 
possible to push the 0-semijoin operator through jFj. 
Further, when E2 is a complex expression, e.g., a join 
expression, the semijoin can also be pushed into E2 to 
optimize the evaluation of Ej. 

Note that both equality and non-equality conditions, 
involving attributes of both E^ and E2, have been used for 
optimizing the evaluation of E^. More importantly, the 
transformation rules used are algebraic equivalences, and a 
cost-based optimizer could choose from among the different 
(equivalent) algebraic expressions generated. 



G = £i xg S(E2) 

10 

If the multiplicities of tuples is important for Q (e.g., it is 
used in an aggregation), and Ej can have duplicates, the 
definition of view E2 cannot be merged into the definition of 
Q. Previous techniques to optimize such expressions include 

15 the technique of pushing the duplicate elimination (8) opera- 
tor through joins, by modeling the 6 operator as a groupby 
on all the arguments, with no aggregation performed. This 
technique is not always applicable, whereas the 9-scmijoin 
transformations can always be used here as described below. 

20 Moreover, the transformations can be combined with the 
optimizations of this technique cleanly within a 
transformation-based optimizer, so that the cheaper one gets 
chosen when both are applicable. 

The effect of the 0-semijoin transformations on Q can be 

25 understood in two steps. In the first step, a new 0-semijoin 
operator is introduced, based on the join condition 0, using 
Transformation la. In the second step, the 6-semijoin is 
pushed through the duplicate elimination operator, using 
Transformation 4a, resulting in the expression 

30 

El X0 S{E2 Xe El) 

The effect is to restrict the tuples of E2 00 which the 
expensive duplicate elimination needs to be performed to 
those that would subsequently join with E^. 

C. Optimizing Queries with Correlated Subqueries and 
Outerjoins 

Consider an SQL query Q that includes a correlated 

^ subquery. For improving set-orientedness of evaluation, the 
query may be decorrelated. The resulting decorrelated query 
has view relations (or table expressions) in the FROM 
clause, and may contain outerjoin operators, e.g., the left- 

^5 outerjoin operator This is required by the semantics of 
correlated subqueries, especially in the presence of aggre- 
gate functions in the correlated subquery. Performing join- 
order optimization in the presence of outerjoins can be quite 
difficult, and traditional query optimizers often perform 
poorly on such complex queries. 

The 0-semijoin transformations can be used to effectively 
optimize queries with outerjoin operators. In particular, the 
transformations can be used in conjunction with any tech- 
nique for decorrelating a correlated subquery, to reduce the 

55 amount of irrelevant computation performed in evaluating 
the decorrelated view relation. 

Consider an SQL query Q whose ETIOM clause includes 
multiple relations, at least one of which is a view relation, 
and some of the relations have to be left-outerjoined. Such 

50 a query can be represented in the extended relational algebra 
by the expression 

65 

This query can be optimized, using Transformation lb, to 
obtain the expression: 
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El iXe {El Xg El) 

When E2 is a complex expression, Q can be further 
optimized using the techniques described herein. 

An alternative decorrelation technique in the prior art, 
Magic Decorrelation, combines the advantages of Magic 
Sets with deconelation. In effect, this technique computes an 
auxihary view relation that contains the bindings that are 
used to invoke the correlated subquery, and the left-outerjoin 
is performed with the auxiliary relation as part of the 
definition of the decorrelated view relation. Magic Decor- 
relation can only use equality conditions; with non-equality 
conditions, using outerjoins would result in incorrect mul- 
tiplicities. 

Outerjoins can also be directly specified by the user in the 
FROM clause of the SQL query. When such outerjoins are 
specified in a view relation, 0-semijoin transformations can 
be used to optimize the evaluation of the view relation. 

The e-semijoin transformations do not have to be made 
heuristicaliy; the transformation rules used are algebraic 
equivalences, and an optimizer could use cost estimates to 
choose from among the different (equivalent) algebraic 
expressions generated. 
V. Cost Model for 6-semijoin 

The costing phase of a transformational optimizer uses 
cost formulas for the operators in the algebra to eflSciently 
compute cost estimates for the different ways of evaluating 
the query. This section describes how cost formulas for the 
9-semijoin operator can be determined, assuming the avail- 
ability of cost formulas for the other operators in the multiset 
relational algebra, in particular the 0-join. 

The 6-semijoin operator is a derived operator in the 
extended multiset algebra, and can be expressed using the 
6 -join, projection (jt) and duplicate elimination (6) opera- 
tors; implementing the 6-semijoin operator in this fashion is, 
however, quite inefficient. 

The 6-semijoin operation Ri^aR2 can be efiEciently 
implemented using minor changes to join techniques such as 
hash joins and index joins. One implementation treats the 
left operand Ri of the 6-semijoin as the "outer" relation in 
the join technique. For each tuple in the outer relation Rj, 
instead of joining it with each matching tuple in the inner 
relation R^, the tuple in R-i can be returned as soon as a 
match is found. Sort-merge joins can similarly be adapted to 
implement e-semijoins if the join condition is an equijoin. 

An alternative implementation treats the right operand R2 
of the e-semijoin as the "outer" relation in the join tech- 
nique. For each tuple in the outer relation R^, all matching 
tuples in the inner relation R^ are returned. If a tuple in R^ 
is already in the result as a consequence of matching a 
different Rj tuple, it is not added to the result; an efficient 
implementation requires an index on the result of the 
6-semijoin, in general. When the 0-semijoin condition 
involves an equijoin with a superkey of Rj, it is guaranteed 
that a tuple in Rj matches at most one tuple in Rj; no index 
on the result of the 6-semijoin is required in this case. Using 
R2 as the outer relation in the join algorithm as above is 
beneficial when, e.g., the right operand R2 of the 0-semijoin 
is smaller than the left operand R^,. 

The cost formulas for the different join techniques are 
easily modified to derive cost formulas for the different ways 
of implementing the 8-semijoin operator. 

Approximate versions of the 9-semijoin operation can be 
efficiently implemented using Bloom filters, which are fixed 
size bit vectors. When the 6-semijoin condition is an equal- 
ity condition, standard hashing techniques can be used to 
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mark bits. For inequality conditions, possible options for 
marking bits include partitioning the domain of an attribute 
value and using a bit to represent each partition. This option 
is particularly useful when the 0-semijoin condition has been 

5 derived from a band join. Cost formulas for using Bloom 
filters can hence be used to derive the cost formulas for 
approximate versions of the 8-semijoin operator. 

Several of the equivalence rules involving the 9-semijoin 
operator (e.g.. Transformation la) introduce common sub- 
expressions. The decision on whether to replicate the com- 
putation of the subexpressions, or to materialize the rcsuU of 
the subexpression and reuse the results, should be based on 
the relative cost estimates of the two approaches and is left 
to the cost-based optimizer. A useful heuristic is to materi- 
alize the results of common subexpressions and reuse the 
results wherever required. If the optimizer cannot explicitly 
deal with common subexpressions, it is straightforward to 
change the formulation of the equivalence rules to replace 
the subexpressions by a (temporarily) materialized relation, 
thereby performing common subexpression elimination at 

20 the level of the equivalence rules. 

A. Heuristics for Cost-Based Optimization 

A cost-based transformational optimizer extended with 
the 6-semijoin operator and our transformation rules would 
generate a superset of the query plans that the un-extended 

25 optimizer would generate. While this could result in a plan 
with a cheaper (estimated) cost being chosen by the 
extended optimizer, it could considerably increase the cost 
of query optimization by increasing the search space. This 
section discusses some heuristics that reduce the generated 

30 search space. It is desirable that these heuristics satisfy the 
following criteria: 

1. The heuristics should not prune away algebraic expres- 
sions that optimize certain classes of SQL queries (for 
example, queries with aggregation views, queries with 

35 SELECT DISTINCT views, queries with correlated subque- 
ries and outerjoins) that are not optimized well using tradi- 
tional join-ordering optimizers. 

2. For SQL queries that are optimized well using tradi- 
tional join-ordering optimizers, the search space should not 

40 increase. 

The following heuristics resU-ict the applicability of the 
transformation mlcs that introduce the 9-semijoin operator. 

First, transformation la should be appUed only when E2 
is of the form 6(E3 ), or of one of the forms ) or jc^ 

45 rFj(E3 )). 

The first case arises when the view in the FROM clause 
has a SELECT DISTINCT, and the latter cases arise when 
the view in the FROM clause has grouping and aggregation. 
Second, transformations Ic and Id should be used only 

5Q when the size of the result of is smaller than that of E2 
on an order of magnitude (e.g., ten times or more smaller). 

The transformation rules that introduce the 9-scmijoin 
operator in the presence of outerjoins should be retained. 
Tliis enables the optimization of the decorrelated view 

55 relations obtained from correlated subqueries. 

These heuristics satisfy the above desired criteria. In 
particular, if the algebraic expression obtained firom the 
query only has occiu^rences of the selection operator Oq, the 
projection operator Jt;, and the join operator the transfor- 

60 mations do not add to the search i^ace at all; traditional 
optimization techniques are adequate for such queries. Also, 
using these heuristics can considerably restrict the search 
space over unrestricted use of the transformation rules, for 
general SQL queries, since a typical query has far fewer 

65 occurrences of the grouping/aggregation operator, the dupli- 
cate elimination operator, and the outeijoin operators than it 
has of the join operator. 
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VI. 6-semijoin and Constraint Magic Rewriting 

Magic rewritings optiniize database queries by defining a 
set of auxiliary magic (or query) relations, that are used as 
filters to restrict computation of the query. Below is a 
derived transformation rule that captures the key intuition of 
Constraint Magic rewriting for a single join. Applying this 
transformation rule repeatedly on a sequence of joins has an 
efifect similar to Constraint Magic rewriting, for a single 
block SQL query. A heuristic for applying the method of 
SQL queries that use view relations in addition to database 
relations is also included. The heuristic simulates the behav- 
ior of Constraint Magic rewriting on such queries. 

A. CM Transformation Step Using 9-semijoin 

The following transformation rule captures the basic step 
of Constraint Magic rewriting: 

where 02 involves only the attributes in attrs (Q)U attrs(Ei), 
and and are defined below: 

E[ =Ei Xe^ Q 

The above transformation is referred to as the Constraint 
Magic Transformation (CMT) Step and is shown pictorially 
as an algebraic equation graph in RG. 4. 

The CMT step is a derived transformation rule, and 
follows from an application of Transformation 12, commu- 
tativity of 8-joins, an application of Transformation 8a (to 
generate E^), Transformation 8b (to generate the basic 
structure of E^), Transformation 2c, and finally the trans- 
formation: crej(Q*<7y„^El)=Q% E^, which defines the 
9-join. As with Transformations l8a and 18b, weaker ver- 
sions of the join and semijoin conditions can be used to 
define El and E2 in the CMT step. 

B. Connection of CMT Step and Constraint Magic Rewrit- 
ing 

The expressions defining E^ and Ej in the CMT step 
capture the essence of Constraint Magic rewriting. For 
example, suppose there is a set of queries Q on the result of 
a join of relations E^ and E2 In Constraint Magic rewriting 
of this join, the "query relation" (also called the "magic 
relation") Q is first used to restrict computation to Ej to 
tuples relevant to Q. Then the set of Ej tuples thus restricted 
are used along with the query Q to restrict computation of 
£2- This strategy is exactly captured in the CMT step. 

More formally, the connection can be established as 
follows. Consider a view defined as: 



12,144 

20 

etrized condition 62. View V is then replaced by view V 
defined below: 

where E"2 is the result of the supplementary Constraint 
Magic rewriting on E^, with the query relation Sj, and the 
parametrized condition O^aBj. 

10 The main difference between the Constraint Magic rewrit- 
ing and the CMT step on a single join is that Constraint 
Magic rewriting uses 6-joins rather than 9-semijoins. 
Although the final expression using 6-semijoin is more 
complex than the definition of V generated by Constraint 

15 Magic rewriting, the added complexity is required to pre- 
serve the multiset semantics. 

C, CM Transformation of an SQL Block Using 9-semijoin 
The algebraic expression V generated by transforming a 
single block SQL query is of the form: 

20 

Given a set of queries on V, denoted by V^^Q, the following 
25 sequence of transformations can be applied to V^^Q. First, 
identify the strongest subset of tj), denoted by tlJ„, that can be 
pushed through the groupby/aggregation operator. If the 
original query did not use GROUPBY, is the same as ip. 
Then,^^ Q can be pushed inside the projection operator, 
30 using Transformation 3a (from RHS to LHS), and the 
groupby/aggregation operator, using either of Transforma- 
tions 10a or 10b, whichever is applicable, to obtain: 

VhT^^ijsFjjii... Ri xo, /fi)... x^^^^ RJ Q)) 

Finally, the CMT step can be repeatedly apphed on the 
expression 

40 C(...(/fi X0^ ffz)... Xfl„_j R„) Q) 

as described below. First define S^, i^l, as follows: 



Also, let t|>^, i<n denote the strongest subset of t|j;^ ^ that uses 
only attributes of Q and S,-, and i<n denote the rest of 
ip^^i. The first application of the CMT step transforms 



with a query relation Q, and a parametrized condition 62A93 
where O2 involves only the attributes in attrs(Q)U attrs(Ei). 
(This is the same as the LHS of the CMT step.) The 
Supplementary Constraint Magic rewriting first defmes 
supplementary relation S^ below: 

s/t-^G x^ e: 

where E"j is the result of supplementary Constraint Magic 
rewriting of Ej with the query relation Q and the param- 



where = x^^, G) and 

60 

Now, consider S'ViJ the 9-semijoin can be pushed into 
the definition of S„.i in exactly the same manner as above. 
Thus the CMT step is applied on each S,-, n^i^2. Note that 
there are two occurrences of S'^^, i.e., it is a common 
65 subexpression of two expressions. By using labeled 
expressions, the cost of optimiziDg and evaluating the 
expression twice is avoided. Using labeled expressions is 
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very important to avoid an exponential blow up as we go 
down from S„ to S^. 

D. Heuristic CM Transformation of SQL Queries With 
Multiple Views 

The 6-semijoin transformation of an SQL query block is 5 
performed. This block may contain uses of view relations, 
and after the transformation the use of a relation R^. may have 
a semijoin of the form R.x^^ Q.., or R.^^p^CQ^^^/^.a). Let 
denote the entire semijoin expression involving R,-. If R,- is 
a view relation, a specialized version R', of the view defi- 
nition of Rf, with the semijoin pushed into it, can be created 
recursively using the 6-semijoin transformation of the SQL 
block defining R,-. Finally, if all-of p,- can be pushed into the 
view definition of R,-, then E,- is replaced by R'; else only R^ is 
in E; is replaced by R',-. 

The relationship between the CMT step and Constraint 
Magic rewriting discussed earlier for a single join also 
carries over to the case of views, and to queries defined using 
multiple views. 

Thus for SQL queries, the effect of Constraint Magic 
rewriting is obtained as a special case of the 6-semijoin 
transformations, in particular by using the CMT step. If the 
full space of equivalent expressions is explored, Constraint 25 
Magic rewriting will be examined as an option, and the 
cheapest expression in the search space will be chosen. 

The foregoing merely illustrates the principles of the 
invention. For example, the features of the invention need 
not only be used for query optimization. Indeed, they may be 
utilized in any procedure requiring the determination of 
equivalent relational algebraic expressions. 

It will thus be appreciated that those skilled in the art will 
be able to devise numerous and various alternative arrange- 
ments which, although not explicidy shown or described 
herein, embody the principles of the invention and are within 
its spirit and scope. 

We claim: 

1. A method comprising the steps of ^ 
receiving a computer programming language database 

system query; 

generating at least one relational algebraic expression in 
response to said query by employing at least one of a 
collection of equivalence rules involving the multiset 45 
version of the relational algebraic theta-semijoin opera- 
tor; 

selecting at least one of said relational algebraic expres- 
sions; and 

accessing said computer programming language database so 
based on said selected expression. 

2. The method of claim 1 wherein said selecting com- 
prises the step of estimating processing costs for at least one 
of said expressions, 

3. The method of claim 2 wherein said selecting is based 55 
on said estimated processing costs. 

4. The method of claim 1 wherein said computer pro- 
gramming language is SQL. 

5. The method of claim 4 wherein said query is selected 
from the group consisting of queries with aggregation views, 60 
queries with SELECT DISTINCT views, and queries with 
correlated subqueries and outerjoins. 

6. The method of claim 4 wherein said query is selected 
from the group consisting of queries with views and queries 
with subqueries. 65 

7. The method of claim I wherein said query is a decision 
support query. 



8. The method of claim 4 wherein said query is at least 
one magic set rewriting of a second computer programming 
language database query. 

9. A computer programming language query optimizer 
comprising 

means for receiving a computer database language query; 
and 

means for generating at least one relational algebraic 
expression by employing at least one equivalence rule 
involving the multiset version of the relational alge- 
braic theta-semijoin operator, said expression being 
equivalent to said query. 

10. A computer processor programmed to perform method 
steps for optimizing an SQL query, said method steps 
comprising 

generating at least one relational algebraic expression by 
employing at least one equivalence rule involving the 
muhiset version of the relational algebraic theta- 
semijoin operator, said expression equivalent to said 
query; and 

performing cost-based algebraic optimization of said 
algebraic expressions, said cost-based algebraic opti- 
mization using at least one cost formula for said 
theta-semijoin operator. 

11. A program storage device embodying a computer 
program to perform method steps comprising: 

receiving an SQL query; and 

generating at least one relational algebraic expression by 
employing at least one equivalence rule involving the 
multiset version of the relational algebraic theta- 
semijoin operator, said expression equivalent to said 
SQL query. 

12. A storage device embodying a collection of equiva- 
lence rules involving the multiset version of the relational 
algebraic theta-semijoin operator, said rules input to soft- 
ware employing relational algebraic expressions, said 
expressions responsive to a computer language database 
query. 

13. A method of optimizing a computer programming 
language query, the improvement comprising the steps of 

receiving said query, 

generating a search space, said search space comprised of 
at least one relational algebraic expression equivalent 
to said query, said expression involving the multiset 
version of the theta-semijoin operator; and 

optimizing said query by estimating cost for said expres- 
sion. 

14. A method of retrieving information firom a computer 
database comprising the steps of 

receiving a computer database language query; 
translating said query into a first relational algebraic 
expression; 

generating at least one relational algebraic expression in 
response to said first relational algebraic expression by 
employing at least one equivalence rule involving the 
multiset version of the theta-semijoin; 

generating a cost estimate for each of said expressions by 
employing at least one cost formula for said theta- 
semijoin operator; 

selecting one of said expressions, said selection based on 
said generated cost estimates; and 

employing said selected expression to retrieve said infor- 
mation from said database. 
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15. A method comprising the steps of 

receiving at least a first and a second computer database 

language queries; 
generating at least one relational algebraic expression for 

each said query by employing at least one equivalence 

rule involving the multiset version of the theta-semijoin 

operator; and 

determining equivalency of any of said generated rela- 
tional algebraic expressions from said first query to said 
generated relational algebraic expressions of said sec- 
ond query. 

16. An apparatus for determining equivalence of SQL 
queries, comprising 

means for accepting at least two SQL queries; and 
means for generating at least one relational algebraic 
expression by employing at least one equivalence rule 
involving the miiltiset version of the theta-semijoin 
operator, said generated expressions equivalent to 
respective queries. 

17. A method comprising the steps of 

receiving a computer programming language database 
system query; 

generating at least one relational algebraic expression in 
response to said query by employing at least one of a 
collection of equivalence rules involving the multiset 
version of the relational algebraic theta-semijoin opera- 
tor; 

selecting at least one of said relational algebraic expres- 
sions; and 

accessing said computer programming language database 
based on said selected expression, said multiset version 
defined as 

where Rl and R2 denote relations, and Bit^) denotes the 
predicate 6 with attributes of replaced by their values 
from tuple t^. 

18. The method of claim 17 wherein said collection 
includes equivalence rule 

£, X0 E2 = Ei Xe (Ez Xe ^0 

where E (with or without subscripts) denotes relational 
expressions, 6 (with or without subscripts) denotes 
predicates, denotes the 9-join operator, and denotes 
the 0-semijoin operator. 

19. The method of claim 17 wherein said collection 
includes equivalence rule 
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(Ei cxffj ^2) Xtfj £j B (El X9^ E'2) 1x9^ £3 
where = ^2 Xei/s^ (^1 XTne £"3). 

5 

E (with or without subscripts) denotes relational 
expressions, 8 (with or without subscripts) denotes 
predicates, % denotes the 9-join operator, and denotes 
jQ the 6-semijoin operator. 

20. The method of claim 17 wherein said collection 
includes equivalence rule 

a£j(£i) Xg Ezm^FjiEi Xg Ej) 

15 

where E (with or without subscripts) denotes relational 
expressions, 8 (with or without subscripts) denotes 
predicates, % denotes the 9-join operator, denotes the 
20 9-semijoin operator, denotes a grouping/ajgregation 
operator, and 9 involves only the attributes in a and attts 

21. A storage device embodying a collection of equiva- 
lence rules involving the multiset version of the relational 

25 algebraic theta-semijoin operator, said rules input to soft- 
ware employing relational algebraic expressions, said 
expressions responsive to a computer language database 
query, said multiset version defined as 

where Rl and R2 denote relations, 9(t2) denotes the predi- 
cate 9 with attributes of R2 replaced by their values from 
mple tj. 

22. A storage device embodying a collection of equiva- 
lence rules involving the multiset version of the relational 
algebraic theta-semijoin operator, said rules input to sofl- 

^ ware employing relational algebraic expressions, said 
expressions responsive to a computer language database 
query, said rules including rule 

El Xb Ez^Ei Xg (El xg El ) 

45 

where E (with or without subscripts) denotes relational 
expressions, 9 (with or without subscripts) denotes 
predicates, «q denotes the 8-join operator, and denotes 
50 the 9-semijoin operator. 

« * * * « 
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